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We present a cross-language C++/Python program for simulations of quantum mechanical sys- 
tems with the use of Quantum Monte Carlo (QMC) methods. We describe a system for which to 
apply QMC, the algorithms of variational Monte Carlo and diffusion Monte Carlo and we describe 
how to implement theses methods in pure C++ and C++/Python. Furthermore we check the effi- 
ciency of the implementations in serial and parallel cases to show that the overhead using Python 
can be negligible. 



I. INTRODUCTION 



II. THE SYSTEM 



In scientific programming there has always been a strug- 
gle between computational efficiency and programming 
efficiency. On one hand, we want a program to go as 
fast as possible, resorting to low-level programming lan- 
guages like FORTRAN77 and C which can be difficult to 
read and even harder to debug. On the other hand, we 
want the programming process to be as efficient as pos- 
sible, turning to high-level software like Matlab, Octave, 
Maple, R and S+. Here features like clean syntax, in- 
teractive command execution, integrated simulation and 
visualization and rich documentation make us feel more 
productive. However, if we have some well tested and fast 
routines written in low-level language, interfacing these 
routines with, e.g., Matlab is rather cumbersome. Most 
often, we will end up using similar Matlab routines, which 
are often written as generally as possible at the cost of 
computing efficiency. 

Recently, the programming language Python [1] has 
emerged as a potential competitor to Matlab. Python 
is a very powerful programming language which, when 
extended with numerical and visual modules like SciPy 
[2], shares many of the features of Matlab. In addition, 
Python was designed to be extendible with compiled code 
for efficiency and several tools are available for doing so. 
In this paper we will demonstrate how Python can be ex- 
tended with compiled code to yield an efficient scientific 
program. Specifically, we will start with a Monte Carlo 
simulator written in C++ and, with the help of SWIG 
[3], reuse the C++ code in a Python Monte Carlo sim- 
ulator. We will show that this porting from low-level to 
high-level code can be achieved without significant loss 
of efficiency. 

The remainder of this paper is organized as follows. In 
section II we define the system we apply the Monte Carlo 
simulator to. Section III discuss in some detail the algo- 
rithms we are going to use, i.e., variational Monte Carlo 
and diffusion Monte Carlo. Next, we go through the im- 
plementations of diffusion Monte Carlo, both in C++ and 
Python, in section III. Furthermore, section V compare 
the efficiency of CH — h and Python for varying numbers of 
CPUs and section VII visualize the output from diffusion 
Monte Carlo with the use of Python. Finally, we round 
off with some remarks in section VIII. 



Quantum Monte Carlo (QMC) has a wide range of appli- 
cations, for example studies of Bose-Einstein condensates 
of dilute atomic gases (bosonic systems) [4] and studies 
of so-called quantum dots (fermionic systems) [5], elec- 
trons confined between layers in semi-conductors. In this 
paper we will focus on a model which is meant to repro- 
duce the results from an experiment by Anderson & al. 
[6]. Anderson & al. cooled down 4 x 10 6 87 Rb to temper- 
atures in the order of 100 nK to observe Bose-Einstein 
condensation in the dilute gas. Our physical motivation 
in this paper is to model numerically this fascinating ex- 
periment. This should be done in an as general as pos- 
sible way, so that we can expand our computations to 
systems not yet explored in experiments. We will in this 
section go through the steps needed to put the experi- 
ment into the framework of QMC. 
In QMC the goal is to solve the Schrodinger equation 
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or rather the time independent version 
ff*(R) = £*(R). 



(1) 



(2) 



Thus, to model the experiment above using Quantum 
Monte Carlo methods, all we need is a Hamiltonian and 
a trial wave function. The Hamiltonian for N trapped 
interacting atoms is given by 
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Taking advantage of the fact that the gas is dilute, we 
can describe the two-body interaction Vi„t(|rj — rj|) by a 
hard-core potential of radius a, where a is the scattering 
length, thus treating the atoms as hard spheres [7]. 
Wc define the trial wave function by 



*T = l[g(r l )l[f(r lJ ), 



(4) 



l<3 



where g(Ti) describes the interaction between one parti- 
cle and the external potential, V ex t, while the two-body 
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correlation function f{rij) describes the interaction be- 
tween two particles. The function /(r^) is the solution 
of the Schrodinger equation for a pair of atoms at very 
low energy interacting via a hard-core potential of radius 
a. The ansatz for f(r) reads 



f(r) 



(1 - a/r) 




r > a 
r < a. 



(5) 



Besides being physically motivated, this type of corre- 
lation has been successfully used in rcfs. [8] and [4] to 
study both spherically symmetric and deformed traps. 
In the experiment of Anderson & al., the particles were 
trapped in a disk-shaped harmonic oscillator potential. 
This corresponds to using an external potential 

Vext = y (uj±x 2 + u±y 2 + u z z 2 ) (6) 

If we neglect the particle-particle interaction and insert 
the potential of cq. (6) into eq. (2) we obtain 

g(r) = A(a)\ 1 / i exp (-a(x 2 + y 2 + \z 2 )) (7) 

where a is taken as the variational parameter of the cal- 
culation and A(a) = (2a/7r) 3//4 is a normalization con- 
stant. The parameter A = lu z /luj_ is kept constant and 
set equal to the asymmetry of the trap. Still following 
Anderson & al. we let A = y/8 throughout this paper. 



III. THE ALGORITHMS 

In this section we will go through the algorithms used in 
this paper. First, we will discuss Monte Carlo integration 
in general. Second, we state the Variational Monte Carlo 
algorithm and last, we go through the Diffusion Monte 
Carlo algorithm. 



A. Monte Carlo Integration 

Monte Carlo integration is best described through con- 
ventional numerical integration methods. In conven- 
tional methods we fix the evaluation points and weights 
of the integrand in advance, 
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If we choose the evaluation points with equal spacing over 
the integration area, a two dimensional integral can for 
example be written as 

pi pi -.mm / 1 \ 

(9) 

For N dimensions we have to carry out m N evaluations 
of f(r) to obtain an accuracy of O (^). In Monte Carlo 



integration the integrand is evaluated at random points 
Yi drawn from an arbitrary probability distribution p(r), 

p p m 

/ /(r)dfi= / g{v)p{v)dn = Y,9{ri) + 
Jn Jn j=1 

(10) 

The main advantage of this scheme is that it is indepen- 
dent of the number of dimensions of R. Note, however, 
that the efficiency of the integration depends on a good 
choice for p(R). 



1. Metropolis Algorithm 

The Metropolis Algorithm [9] generates a stochastic se- 
quence of phase space points that samples a given prob- 
ability distribution. In Quantum Monte Carlo meth- 
ods each point in phase space represents a vector R = 
{ri, r2, . . . , r/v} in Hilbert space. Here rj represents 
all degrees of freedom for particle i. Coupled with a 
quantum mechanical operator (like the Hamiltonian in 
eq. (3)) each such point can be associated with physi- 
cal quantities (the Hamiltonian gives the energy of the 
system). The fundamental idea behind the Metropolis 
algorithm is that the sequence of individual samples of 
these quantities can be combined to arrive at average val- 
ues which describes the quantum mechanical state of the 
system. The Metropolis algorithm provides the sample 
points. From an initial position in phase space a pro- 
posed move is generated and the move is either accepted 
or rejected according to the Metropolis algorithm 

P A (R',R)=mm(l,^j. (11) 

where Pa(R', R) is the probability of moving the particle 
from R to R'. In this way a random walk generates a 
sequence {Ro, Ri, . . . , R,, . . .} of points in phase space. 
It is important that all points must be accessible from 
any starting point; the random walk must be ergodic. 
Metropolis & al. showed that the sampling is most easily 
achieved if the points R form a Markov chain. A random 
walk is Markovian if each point in the chain depends only 
on the position of the previous point. 



B. Variational Monte Carlo 

Variational Monte Carlo (VMC) is the starting point of 
all Monte Carlo calculations in that we need an optimized 
trial wave function as input to the other Monte Carlo 
methods. In VMC we combine Monte Carlo integration 
with the Metropolis algorithm and the variational prin- 
ciple to get the trial wave function that yields the lowest 
energy. 

The variational principle states that, given a variational 
wave function ip a (where a — a\, a 2 , ■ ■ ■ denote a set of 
variational parameters), the energy expectation value of 
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ip a provides an upper bound to the ground state energy, 
i.e., 

{ H )a ={^ H ^ m >(H). (12) 



Rewriting eq. (12), 



with the local energy 



(14) 



we can interpret the square of the wave function divided 
by its norm as the probability distribution of the system, 
arriving at 

(H) a = J E L (R)p(R)dR (15) 



where 



p(R) = 



|V^(R)| J 



J|V Q (R)| 2 dR' 



(16) 



We then carry out the integral of eq. (15) with Monte 
Carlo integration. We move a walker randomly through 
phase space according to the Metropolis algorithm, and 
sample the local energy with each move. This way we 
get a statistical evaluation of the integral. 
The VMC algorithm consists of two distinct parts. First, 
as the Metropolis algorithm reproduces the probability 
distribution in the limit t — > oo, you need a thermal- 
ization, where the walker propagates according to the 
Metropolis algorithm in order to equilibrate it to the 
probability distribution p(R). Second, you continue to 
move the walker, but you sample energies and other ob- 
servables for computation of averages and other statisti- 
cal observables. 

In algorithm 1 the particles are moved one by one and not 
as a whole configuration[20]. This improves the efficiency 
of the algorithm for larger systems, as moving the whole 
configuration requires decreasing the steps to maintain 
the acceptance ratio [10]. 



C. Diffusion Monte Carlo 

In Diffusion Monte Carlo (DMC) we seek to solve the 
Schrodinger equation in imaginary time. This involves 
Monte Carlo integration of a Green's function. As the 
Green's function is approximated by splitting it up in 
a diffusional part (which has the form of a Gaussian) 
and a branching part we also need a Gaussian random 
generator and a way to create and destroy walkers. 



1. Basic Ideas of DMC 

The basic ingredients of DMC are [11]: 

1. It considers the Schrodinger equation in imaginary 
time, 



dd^{R, t) 
dt 



[H-EWR,t), 



(17) 



where R represents the set of all coordinates. The 
formal solution of (17) is 



V>(R,i) = e-[ ff - £ l t V(R,0), 



(18) 



where exp[— (H — E)t] is called the Green's func- 
tion, and E is a convenient energy shift. 

2. The wave function is positive definite everywhere, 
as it happens with the ground state of a bosonic 
system, so it may be considered as a probability dis- 
tribution function. (This assumption leads to diffi- 
culties when we consider fermionic systems, where 
the wave functions are anti-symmetric and special 
care needs to be made.) 

3. The wave function is represented by a set of ran- 
dom vectors {R\, R2, • • • , Rm}, in such a form that 
the time evolution of the wave function is actually 
represented by the evolution of the set of walkers. 

4. The actual computation of the time evolution is 
done in small time steps r, and the Green's function 
is approximated accordingly, 



-[H—E]t _ 



IT 



-[H-E]t 



(19) 



where r = t/n. 



5. The imaginary time evolution of an arbitrary start- 
ing state ip(R, 0), once expanded in the basis of 
stationary states of the Hamilton operator 



V>(R, 0) = C„<f> n u(R) 



(20) 



is given by 

V(R,i) = ]T e - [ ^- £lt a<MR), (21) 

in such a way that the lowest energy components 
will have the largest amplitudes after a long elapsed 
time, and in the t — -> 00 limit the most important 
amplitude will correspond to the ground state (if 
Co 96 0) [21]. 

6. An improvement of this scheme is the introduction 
of importance sampling. 
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Algorithm 1: VMC algorithm 

Generate initial randomized configuration 
for to VMC steps 

for to particles 
Propose move r — > r' 
Compute ratio w — \ip(R')/ipCR)\ 2 

^4ccep( or reject according to the Metropolis probability min(l,w) 
Sample the contributions to the local energy and other observables in this move 
Sample the contributions to the local energy and other observables for this series of movements 
Do statistics and variate the parameters in wave function according to the statistics 



The scheme is quite simple; once we have found an appro- 
priate approximation for the short-time Green's function 
and determined a starting state, the job consists in rep- 
resenting the starting state by a collection of walkers and 
letting them evolve in time, i.e., obtaining a collection of 
walkers from the old collection of walkers, up to a time 
large enough so that all other states than the ground 
state are negligible. 



2. Shortime Green's Function 

In coordinate representation the Green's function is given 
by the matrix element 

G(R',R,t) = (R'|e- [H - £lt |R), (22) 

and the time evolution equation is 

^(R,t) = /G(R',R,i)V(R,0)dR. (23) 

At t = the value of the Green's function is 

G(R',R,0) = (5(R'-R). (24) 

From the operatorial representation of the Green's func- 
tion we can easily obtain the formal differential equation 



or written out in the coordinate representation, 



(25) 



dG(R',R, t) 

Ft 



% i<j 
G(R',R,t) 



with the boundary condition (24). 

The Hamiltonian, eq. (3), can be rewritten as 



H 



-DV 2 + V(R) =K + V, 



(26) 



(27) 



where D = h/2m is a diffusion constant, K is the kinetic 
energy operator and V is the potential energy operator. 



The Green's function can now be written as G(R', R, t) = 
(R'|e~[' ff+y_ ' E l t |R). The main problem in obtaining the 
Green's function is that K and V does not commute. 
The Green's functions related exclusively to the kinetic 
operator and the potential operator is, however, readily 
determined. 

The kinetic operator can be written as 



G K (R',R,t) - j2~jW J 



e" ikR e- Dtk2 e ikR dk, (28) 



If we carry out the integral of eq. (28) we get the Green's 
function 



G^(R', R> t) 



1 



{AnDt) 3N / 2 



-(R-R') 2 /4Dt 



(29) 



It is easily checked that this Green's function is the Dirac 
delta function 6{R — R') in the t — limit. 
The Green's function related to the potential is even sim- 
pler, for momentum independent interactions. The po- 
tential is then a local operator and the Green's function 

G y (R',R,t) = e- y(R) ^(R-R'). (30) 

Approximate forms for the full Green's function are ob- 
tained by using the following approximations (with time 
step t) [11]: 



(R'|e- (K+V/)T |R) = (R'\e- KT \K"){Il"\e- VT \Il) +0(t 2 ) 
= (R'|e- yr |R")(R"|e- KT |R) +0(t 2 ) 
= (R'|e- Vr / 2 |R")(R"|e- Kr |R"')x 
(R"V W/2 |R) +0(t 3 ). 

(31) 

In all cases an integration over the internal coordinates 
R" and R'" must be understood, being easily carried 
out by using the delta functions appearing in Gy. The 
practical equations are 

G(R', R, r) = G K (R', R, t)^ e - v ™ t + 0(r 2 ) 

^^ e - v ^ t G k (R',R,t)+0(t 2 ) 

= e [ £ -W R ')+^ R )}/ 2 ^GK(R', R, r) + G(r 3 ). 

(32) 
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3. Importance Sampling 

An important improvement to the DMC scheme above 
is, as mentioned above, the use of importance sampling. 
In problems with singularities in the potential (e.g., the 
Coulomb potential) the Green's function exp[— (H — E)t] 
will reach unbounded values, leading to an unstable al- 
gorithm. Even without singularities the scheme above is 
inefficient. This is due to the fact that we have imposed 
no restrictions as to where the walkers will walk. 
Consider the imaginary time evolution of the quantity 



/(R,i) = V>T(R)V(lM), 



(33) 



where ip is the wave function satisfying the Schrodinger 
equation and ipT is a time independent trial function, 
preferably close to the exact ground state wave function. 
It is not necessarily the starting wave function, even if it 
normally also is taken to be the starting wave function. 
The time evolution of /(R, t) is given by inserting ip T in 
the Schrodinger equation: 

_9V^)^ (R) = _ Z?(V ^(R ) f ) ) V , T( R) + ^ 

(V - E)iP(K,t)i> T (Il) 

with D = h 2 /2m. As 

V 2 {M) = (V 2 V#t + iMVVt) + 2(V^)(V^ T ) (35) 
we have 

~% = ' D ^ 2 f+ 2D ^^^T)+DtP(W 2 ^ T ) + (V-E)f. 

(36) 

Introducing the so called drift force (which is a drift ve- 
locity), given by 



F = — V^r, 
Wt 



and using that the local energy is 



E L = -D 



1p T 



+ V 



(37) 



(38) 



we end up with the (imaginary) time evolution of / as 

= -DV 2 f + DV ■ (Ff) + (E L - E)f. (39) 

The transformed Hamilton operator working on / in cq. 
(39) may be written as a sum of three terms 



H = K+F+L 
K = -DV 2 

F = -£>(V • F(R)) + F(R) • V 
L = E L (R) 



(40) 



corresponding respectively to the kinetic part, the drift 
part and the local energy part. 



An 0(t 2 ) approximation of the Green's function is given 
by [12]: 

(R'|G|R) =- L-_ e -[tt'-*-^F(*)] 2 /4Dr x 

V 1 1 ' (4itDt) 3N / 2 (4i) 

e Er-[E L (R')+E L (R)]T/2 + 0{ T 2 ). 

while an C(r 3 ) approximation the Green's function is 
obtained from [13] 

G = e ET e- L ' 2T e- F ' 2T e- KT e- F ' 2T e- L ' 2T + 0(r 3 ). (42) 



4. DMC Algorithm 

In algorithm 2 we state the DMC algorithm correspond- 
ing to eq. (41). The algorithm corresponding to cq. (42) 
is similar except that the move is split into four parts due 
to the splitting of the drift operator. £ in the move part 
of algorithm 2 is drawn from the mul tivari ate Gaussian 
distribution with null mean and a = V2Dt, the solution 
of the kinetic Green's function. 

It can be showed [10] that importance sampling may 
be used in Variational Monte Carlo as well, using the 
Fokker-Planck formalism. The resulting algorithm is 
identical to alg. 2, but without the branching term. We 
will therefore concentrate on diffusion Monte Carlo in the 
rest of this article. 



IV. THE IMPLEMENTATIONS 

In the previous sections we have identified a physical sys- 
tem to simulate and found algorithms to use in the simu- 
lations. One important question that remains is how we 
implement the system and the algorithms. In this section 
we will propose three different approaches. They all use 
the same algorithms, they solve the same systems, with 
identical results, and they are all written in an object 
oriented way. In fact, most of the code is the same for all 
three approaches. The only difference in the implemen- 
tations is the amount of time spent in low level, compiled 
language (represented by C++) versus time spent in high 
level, interpreted language (represented by Python). The 
assumption is that compiled code is faster while inter- 
preted code is clearer, easier to debug and easier to ex- 
pand. We will in this section go through a pure C++ im- 
plementation, a straight forward Python approach and a 
more involved Python approach. 



A. C-\ — |- implementation 

The base of our implementations is a serial diffusion 
Monte Carlo (DMC) solver written in C++. The Python 
solvers are both heavily based on this code. We will 
therefore first go through the C++ implementation of 
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Algorithm 2: DMC Algorithm 



Generate an initial set of random walkers with the Metropolis Algorithm 
for to time 

for to N watk ers 

Diffusion: 

for to particles 

propose move r' = r + DrF(r) + £ 

Branching; calculate replication factor: 

n = mt(exp{r(£ L (R)/2 + E L (R')/2 - E)}) 

if n = 

Kill the walker 

if n > 

Allow the walker to make n — 1 clones 
Remove dead walkers, and make new clones 
Check walker population and adjust trial energy 
sample contributions to observable 




gctLocalEncrgy(Func& local_c, 
bool update) 



gctLocalEncrgy(Func& wf, 

Func& nabla2, 
Func& potential, 
Func& update) 



pperator-(Walkcr& c]nnp _mc) 



+ functions for getting F_q and 
greens function, marking a walker 
as dead etc. 



Private functions and variables, 

eg: 

double *particles 

CD 



FIG. 1: Class diagram of DMC 



DMC. In figs. 1 and 2 we present the class diagram and 
float diagram of the C++ implementation. 
In fig. 1 we show three classes, class DMC, class Func 
and class Walker. 

• The class DMC contains the DMC algorithm, im- 



plemented in the function difiMCQ (and helper 
functions to clean up the code). It also contains 
a pointer to the class Func and an array of walker 
objects (or just walkers). 

• The class Func contains functors, i.e., classes whose 
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only purpose is to receive a set of numerical val- 
ues and transform these to numerical output (not 
unlike mathematical functions). Specifically, Func 
contains different wave functions (with correspond- 
ing analytic local energies and quantum forces if 
implemented) along with generic functions for the 
gradient and Laplace operator. The different func- 
tions of the systems are subclasses derived from 
general functions to ensure that the functions of all 
the systems have the same input and output. 

• The class Walker contains all the physical informa- 
tion of a walker, that is, its position in phase space 
(and function for setting and getting the position) 
and functions for getting physical values like the 
energy of the walker and the wave function of the 
walker. 

The advantage of this division of the program is quite 
clear. The class DMC contains the DMC algorithm 
and may easily be replaced with other Monte Carlo al- 
gorithms, like the already mentioned variational Monte 
Carlo, Green's Function Monte Carlo and so on. These 
replacements will neither affect the systems implemented 
in class Func nor the physical information of the walk- 
ers. Likewise, new systems may be implemented without 
changing the code of the algorithm[22]. The wave func- 
tion and the potential (or optionally an analytic expres- 
sion of the local energy and the quantum force) are sent 
to the walkers as pointers to Func objects and are as such 
not known to the walkers at compile time. 
Fig. 2 shows the float diagram of the DMC program. 
The algorithm is divided into functions so that, e.g., the 
function diffMCQ contains a loop calling the function 
oneTimeStep(), which in turn loops over oneMonteCar- 
loStep() and so on. Each such function is represented by 
a box in the float diagrams. 

Looking at the float diagram, fig. 2, it is easy to re- 
alize that most of the time of computation is spent in 
the bottom boxes of the diagram. When implementing 
the DMC in Python the bottom boxes should be kept 
in C++ while only diffMCQ (which is in broad lines the 
hole DMC algorithm) will be in Python code. 

1. Checkpointing 

And aspect which is frequently forgotten when writing 
a scientific program is the aspect of checkpointing. A 
Monte Carlo simulation may easily take several days, 
or even weeks and months. This would be unfeasible 
without some way to stop and start the simulation in 
case of computer crashes, power losses or overeager com- 
puter managers. In checkpointing we store all informa- 
tion needed to resume the computation at given steps 
of the simulation. The challenge is to identify the steps 
at which the checkpointing should be made. The check- 
points should be made frequently enough to save time 
compared to starting all over, but not so frequently that 



the simulation is significantly slowed down. 
In variational Monte Carlo it suffices to write a new ini- 
tialization file where the number of steps is reduced to 
what is remaining of the original number of steps and 
a random seed so that we continue the random stream 
we have started. The latter is important if we want to 
get reproducible results. As the amount of data stored 
in a checkpoint is so small, we can do it after every step 
without reducing speed. However, making a checkpoint 
during the movement of the particles would be quite cum- 
bersome and the amount of data needed to store the 
checkpoint would increase dramatically. 
Again, diffusion Monte Carlo is more challenging. As 
generating a starting state in effect takes a variational 
Monte Carlo run, we have to store all the walkers at 
every checkpoint. This involves storing all particle posi- 
tions, the last calculated local energy and quantum force 
(which is a vector) and so on, for every walker. In the 
C++ implementation this is realized by the functions 
gctBuffer and setBuffer in the Walker objects. In a call 
to gctBuffer the walker puts all it's information into a 
character array. In a checkpoint these arrays are con- 
catenated and dumped to file. When restarting the pro- 
gram, the arrays are read from the file and sent to the 
walkers through setBuffer. The checkpoints are, as for 
variational Monte Carlo, made after every time step. 

2. Generating random numbers 

When all is said and done the most important function 
in a Monte Carlo method is the random number genera- 
tor. The Monte Carlo integration depends on a walker's 
ability to reach all points in phase space from its starting 
point. If the random numbers determining the movement 
of the walker are in some way correlated, the walker will 
lose this ability. A good random number generator is 
therefore of great importance. Consider the simple one- 
dimensional definite integral 

F= f f(x)dx (43) 
Jo 

To solve this equation numerically, we approximate F in 
terms of Fn- 

F= lim F N (44) 

JV^oo 

where 

I N 

i=l 

When we solve eq. (43) using Monte Carlo integration, 
we draw the sample points {xt} randomly from a given 
probability density function. However, as a computer 
only has a finite sized set of numbers available, we have 
to use random numbers generated from a pseudo-random 



input 



I 



diffMC 



I 



set initial position 
adjust with VMC (without parameter variation) 
find trial energy 



for(int i; i ! =termalization; i++) { 
oneTimeStep (ran, i) ; 

adjust time step length} 
reset energies 

for(int i; i!=steps; i++) 
oneTimeStep (ran, i) ; 

do statistics 

for(int k=0; k ! =M; k++) 
oneMonteCarloStep (ran, k) 




destroy dead walkers 
adjust trial energy and accumulate energies 



find e_local_x 



for(int i=0; i ! =particles ; i + +) { 
find wf_x and f q_x 
move particle i 
find wf_y 
diffuse ( ran, k, wf_x, wf_y ) ; } 

branch (ran, k, e_local_x) ; 



diffuse : 

find w= (wf_y/wf_x) (G (y, x, tau) /G (x, y, tau) ) 
accept or reject according to min(l,w) 



branch : 

n=int (exp (- ( . 5* (e_local_x+e_local_y) -e_trial) *tau) 
n=0->kill walker 
n>0->for (n-1) { 
copy walker} 



FIG. 2: Float diagram of DMC 
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number generator (PRNG). For every PRNG there is a 
finite number of pseudo-random numbers, known as the 
cycle length of the PRNG. When this cycle length is 
reached eq. (44) will cease to converge. This may not 
seem like a serious problem as the cycle length can be 
made quite large by using better PRNGs. However, we 
have to take care to choose a good PRNG. To take an 
example, the PRNG ranO (see [14]) has a cycle length of 
about 2.1 x 10 9 . On a 3.40GHz Intel(R) Xcon(TM) CPU 
ranO takes 40 seconds to run through one cycle. It is 
obvious that using this (widely used) PRNG will lead to 
problems when a Diffusion Monte Carlo simulation takes 
several days of CPU time. 

The generation of random numbers is a science in itself 
and, though of great importance to Monte Carlo meth- 
ods, we will not go through this aspect in detail. We 
can advise the interested reader to read the introduction 
of [15]. In the simulations in this paper we have used a 
64-bit linear congruential generator with prime addend 
[16, 17] which has a period of 2 64 . Linear congruential 
generators may have correlations between numbers that 
are separated by a power of 2. We should therefore take 
care to avoid using this generator in batches of powers of 
2. In our case, this happens in all 3D simulations with 
variational Monte Carlo. In algorithm 1 we make 3 calls 
to the random generator to move one particle, then one 
call to determine whether the move should be rejected 
or accepted. This is repeated for every particle. In our 
case, this can be avoided by letting the acceptance/re- 
jection procedure use another random stream than the 
movement and making sure these streams are uncorre- 
cted with each other. 



B. Parallelizing the C-\ — |- implementation 

To parallelize Variational Monte Carlo (VMC) is em- 
barrassingly easy. As long as you ensure that all the 
calculations use different sets of random numbers (and 
thereby ensuring that the calculations are uncorrelated) 
the algorithm is parallelized by running an independent 
calculation on each node. The communication between 
the nodes is restricted to spreading the input parameters 
before the calculations and collecting the output after 
calculation. The parallel efficiency is essentially 100%, 
and the calculation can theoretically use any number of 
nodes without efficiency loss. 

The parallclization of Diffusion Monte Carlo (DMC) is 
more cumbersome. This is due to the branching part 
in algorithm 2 where walkers are killed or reproduced. 
If we had parallelized DMC in a straight forward way, 
i.e., by starting one DMC run per node with different 
sets of random numbers and collected the results at the 
end, the walkers would be unevenly distributed among 
the processes, leading to an inefficient DMC code. For 
a DMC code to function properly it needs an as large 
as possible number of walkers to get a good representa- 
tion of the wave function. A lot of unconnected DMC 



simulations will basically yield a set of not-so-good wave 
functions. We therefore have to collect all the walkers, 
remove dead walkers and make copies of the more vir- 
ile walkers according to the branching process and then 
redistribute the walkers at every time step. 
The parallelization is realized by a division of the walker 
array. A master node stores an array of the full number 
of walkers and distribute these walkers evenly between 
the slave nodes where the walkers are stored in smaller 
walker blocks. The preparation to sending and receiving 
the walker blocks is identical to the checkpoint procedure 
mentioned above, sans the file writing and reading. In 
fact we use the MPI_Pack procedure to pack the walkers 
for checkpointing, even in the serial program. The only 
difference is that we send and receive the walkers instead 
of writing to and reading from file. 

The main problem left is then to ensure that the sets of 
random numbers in fact are independent. 



1. Generating random numbers in parallel 

Generating random numbers in parallel is not as straight- 
forward as one may think. A common first approach is to 
start the same random generator on every node, varying 
the seed with the rank of the node as a factor to get 
independent streams and hoping that these streams are 
uncorrelated. The main problem with this approach is 
that random generators often have long-term correlations 
which is of little importance in the serial case, but may 
appear as short-term correlations in a parallel case [16, 
17]. In the extreme case, we may chose seeds yielding 
random numbers separated with exactly one cycle. In 
this case we will end up with Nqpu identical streams, 
yielding Nc pu identical simulations and extremely good 
(but wrong) statistics in the results. Several approaches 
to get safe streams in parallel are suggested in [16, 17] 
and implemented in the SPRNG library which we use in 
our simulations. 



C. Python implementation I 

The C++ implementation uses about 90% of the time in 
the walker objects and most of this time in computing 
local energies (A 3 operations where N is the number of 
particles) in functions located in Func. In the python 
implementation of diffusion Monte Carlo (pyDMC) the 
classes Walker and Func are therefore linked into a shared 
library readable from Python, through a thin wrapper 
module, together with the functions from the DMC class 
below the function oneTimeStepQ in fig. 2. 
The main obstacle in implementing pyDMC is the han- 
dling of the walkers. In a straight forward approach we 
put the walker objects in a native Python array. This is a 
very tempting approach; we can leave the entire problem 
of creating and killing walkers [23] to Python. Another 
approach is to make a walker array class in C++. This 
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way we can avoid explicit looping in the Python code, 
but we are again left to take care of varying array sizes 
in C++. 

To understand the first approach, we must have a look 
at how to put the walkers into a native array. To get a 
C++ class visible from Python it has to be compiled and 
linked into a shared library. This step is taken care of by 
the use of SWIG [3]. The walker array is then realized 
by the function warray, listing 1. 

A great advantage with Python is that you can expand 
a class in run-time (or in fact build an entire class in 
run-time). Utilizing this advantage, we have inserted the 
function warray into the class Walker where it naturally 
belongs, as can be seen in the function funcToMethod, 

I 



listing 2. This approach is particularly handy if we want 
to expand the Func class with new physical systems, en- 
abling us to write the new functors in pure Python code. 
However, as the functors are where most of the computa- 
tion time takes place, this approach will severely hinder 
the effectiveness of the simulation. 

In the native array approach most of the paralleliza- 
tion is realized with the functions spread_walkers and 
gather.walkers, listing 3. The functions walkcrs2py and 
py2walkers are functions for converting a walker to a 
NumPy array and back again, taking advantage of the 
functions gctBuffer and setBuffer in the Walker class. 
An example of a parallelized diffusion Monte Carlo pro- 
gram is realized in less than 100 lines: 



»> import pypar , math 

»> from DMC import DMC 

»> 

»> d = DMC(pypar) 

»> # 1 particle and alpha — 0.5 yields a harmonic oscillator with 

»> # energy == 3/2: 

»> d.params[0] = 0.5 

»> d . rcsct.params () 

»> d. silent = True # to avoid too much noise 
»> 

»> def timcstcpf i.stcp ): 

... M = d. no.of.walkcrs 

... d. spread.walkcrs () 

... for walker in d . w.block : 

... d. montc.carlo.stcp(walkcr) 

... d. gather.walkors () 

... d. update = False 

... i f d . master : 

... #bring out your dead 

foriinrange( M- 1.-1,-1): 
if d .w[ i ] . isDcad ( ) : 

... d.w[i:i + l] = [] ^removing walker 

... else: 

... while d.w[ i ] . tooAlivo () : 

. . . baby.walkcr = d. copy.walkcr (d ,w[ i ] ) 

... baby.walkcr . calmWalker () 

d.w += [baby.walkcr] 

d . w [ i ] . madcWalker ( ) 

... d. no.of.walkcrs = len(d.w) 

. . . d. no.of.walkcrs = d. pypar . broadcast (d. no.of.walkers , 

... d.mastcr.rank) 

... d. refresh.w.blocks () 

... d. spread.walkcrs () 

... d.num_args[ — 1] = d. update 

... i f d . master : 

... nrg = 0. ; pot.nrg = 0.; vort.nrg = 

... d.time.stcp.countcr += 1 

... for walker in d.w: 

... d.num_args[ — 1] = d. update 

... nrg -j-= walker . gctLocalEnergy ( * d . num.args) 

... pot.nrg += walker . getLocalEnergy (d. pot.e ) 

if hasattr (d, ' vort.e ' ) : 

... vort.nrg += walker . getLocalEnergy ( vort.e ) 

... nrg /= float (d.no_of_walkcrs*d. particles *d. scale) 

... pot.nrg /= float (d.no_of_walkcrs*d. particles *d. scale) 

... vort.nrg /= float (d.no_of_walkcrs*d. particles *d. scale) 

... d . c n c r g y += nrg 

... d.cnergy2 += n r g * n r g 

... if d.time.stcp.countcr >= d. termalization : 

... #implement p o s - hist to be called here 

... d. obscrvablcs [ i.stcp ,0] = nrg 

... d. obscrvablcs [ i.stcp , 1] = pot.nrg 

... d. obscrvablcs [ i.stcp ,2] = vort.nrg 

... # adjust trial energy (and no . of walkers) 

... nrg = — .5*math. log ( float (d . no.of.walkcrs )/ float (M)) / d . tau 

... d.c.trial += nrg 

... d.c.trial = d. pypar. broadcast (d.c.trial .d.mastcr.rank) 

. . . d. no.of.walkers = d. pypar . broadcast (d. no.of.walkcrs . 

... d.mastcr.rank) 

»> #set initial walker positions: 

»> if d.mctropolis.tcrmalization: d. uni.dist () 

»> for i in range (d. metropolis.tcrmalization ): 

... for walker in d. w.block: 

... d.metropolis.step(walker) 

»> 

»> nrg = 0. 

»> 

»> for walker in d. w.block: 

... nrg += walker . getLocalEnergy (*d. num.args) 

»> d. gather.walkers () 

»> d.c.trial = d.all.rcducc(nrg) 

»> d.time.stcp.countcr = 

»> for i in ran go (d. termalization): 

... timestep(i) 

»> d. energy = 0: d.cncrgy2 = 
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Listing 1: warray function 

def warray (self ,size , particles , dim ) : 

"""function returning an array of (initialized) walkers""" 

w= [] 

for i in range(size): 
w += [ Walker ( ) ] 

w[i].pylnitialize (particles , dim ) 
return w 



Listing 2: funcToMethod 

def funcToMethod ( func , clas , method_name=None ) : 

"""function to insert a python func into a class 
Taken from Python Cookbook, recipe 5.12 

setattr ( clas , method_name or func. name , func) 

funcToMethod (warray , Walker) ^insert function warray in class Walker 



»> for i in range (d. steps): 
... timestep(i) 

»> if d . master : 

... d. energy /= float (d. steps) 

... d.energy2 /= float (d. steps] 

... d.cncrgy2 — = d . energy *d . energy 

... print "energy = t% +/- Kg " %(d . energy , 

... math. sqrt(d.energy2 / float (d. steps))) 

... print "sigma = g " %d . c n c r g y 2 

energy = 1.5 +/- 
sigma = 

»> pypar . Finalize () 



D. Python implementation II 

When thinking performance of arrays in Python the add- 
on package Numerical Python (NumPy) springs to mind 
as an obvious choice. The fact that the module pypar 
(which we are going to use in the parallelization) supports 
sending NumPy arrays directly, is of course helping in 
that choice. However, even though NumPy supports a 
lot of types, (such as integers, floats, chars etc.) there is 
no support for walkers as a type [24]. It is possible to use 
generic python objects in NumPy arrays, but this will 
mainly make NumPy array comparable to native arrays. 
The approach with native python arrays is quite straight- 
forward and easy to implement. It is, however, quite inef- 
ficient as well. There are two main reasons for this. First, 
looping is known to be an inefficient construct in Python. 
With a native Python array, the loops over walkers have 
to be done in Python. Second, native python arrays are 
slower than, e.g., NumPy arrays, because native arrays 
are written for a much more general use than just nu- 
merics. The question is how we can use the most of the 
C++ walker code as-is while avoiding explicit for-loops 
in Python. 

One solution is to implement an array class (lets call it 
WalkerArray) in Python which is wrapper class to a C++ 
class containing a C++ array of walkers and functional- 
ity to create and kill walkers. Even though this is a good 
approach in a serial implementation, we still have to con- 
vert these arrays to NumPy arrays to be able to send the 



walkers in MPI. 

In our Python implementation, we keep the WalkerAr- 
ray, but store all the walker data in a NumPy array. To 
do this we have modified the C++ Walker class so that 
it only uses pointers to an array for all arrays and vari- 
ables that should be stored. This array is then provided 
by NumPy. Even though this approach taints the C++ 
implementation of the Walker class, we get the advan- 
tage that the Python DMC class only has to care about 
NumPy arrays, providing us with powerful tools for vec- 
torizing the Python code. 

1. WalkerArray 

To implement the class WalkerArray we need to have 
some knowledge of the C++ Walker class. The Walker 
class contains some arrays storing all particle positions 
and all previous particle positions and variables to know 
if the walker should be removed or duplicated. In addi- 
tion it stores the last computed local energy, quantum 
force and wave function to minimize the number of times 
we compute these quantities. In the C-l — h code this infor- 
mation is allocated and stored in each walker, making the 
creation of a walker rather costly. When we send walk- 
ers in MPI the information is collected from each walker 
and concatenated into one array before communication 
and inserted into the walkers after communication. Now, 
we want turn it the other way, i.e., we want to allocate 
and store all information from all walkers in one array 
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Listing 3: Spread and gather 

def spread_walkers ( self ) : 

"""Function converting walkers to numpy arrays, sending, 
recieving and converting back""" 
if self, master : 

displace = s e 1 f . loc .walkers [ s e 1 f . master .rank ] 
for i in range ( 1 , s e 1 f . numproc ) : 

send_w = self .w[ displace : disp lace+s e 1 f . loc_walkers [ i ] ] 
send_buff = walkers2py ( send_w ) 
self . pypar . send (send_buff , i ) 
displace += s e 1 f . loc .walkers [ i ] 

else : 

recv_buff = s e 1 f . pypar . receive ( s e 1 f . master _rank ) 

w_args = [ self . loc_walkers [ self . myrank] , self. particles ,\ 

self . dimensions ] 
self.w_block = py2walkers ( recv_buff , *w_args) 

def gat her .walkers ( s e 1 f ) : 

"""Look at spread_walker s in a mirror""" 
if self . master : 

displace = s e 1 f . loc .walkers [ s e 1 f . master .rank ] 
for i in range ( 1 , s e 1 f . numproc ) : 

recv.buff = s e 1 f . pypar . receive ( i ) 

w.args = [self, loc .walkers [i], self, particles, \ 

self . dimensions ] 
self .w[ displace : displace+s e 1 f . loc.walkers [ i ] ] — py2walkers(\ 

recv.buff , *w_args) 
displace += s e 1 f . loc .walkers [ i ] 

else : 

send.buff = walkers2py ( s e 1 f . w.block ) 

self . pypar .send(send_buff, self, master .rank) 



(preferably a NumPy array) and let the walkers operate 
on pointers to this array. This way the time to initialize 
a walker is reduced dramatically and all information on 
walkers are readily available from Python. This change 
of view for the Walker class is realized by changing all 
variables that define the walker to references to the cor- 
responding pointers to the NumPy array. 

V. PYTHON VS. C++ 

Now we know how to implement a Python version of 
a Monte Carlo solver. We then need to know if Mon- 
tcPython is efficient enough. We can assume that the 
Python implementation will never be faster that the cor- 
responding CH — h code, as Python will always have some 
degree of overhead just to access the C++ code. The 
question is how big this overhead may be. In figure 3 we 
have plotted the speedup of a Monte Carlo simulation as 
a function of the number of CPUs [25]. To the left we have 
done a relatively light simulation with 20 particles in 3 
dimensions using 480 walkers. The walkers are spread 
out evenly and communicated from the master node to 
the slave nodes and back again 3500 times so that all 
walkers are sent 7000 times. The size of one walker is in 
this case 1.2kB which means that for, e.g., 4 CPUs the 



size of each message is about 144kB. We can see that for 
this message size the overhead of using Python is almost 
none. 

To the right in figure 3 we have increased the number 
of walkers to 4800. However, we are also using a much 
higher number of CPUs. The message size for, e.g., 64 
CPUs is only 9kB. As we explicitly loop over the number 
of CPUs when we send walkers to and from the master, 
the overhead increases dramatically when compared to 
the time to send one message. The send and receive 
methods should therefore be vectorized with scatter and 
gather routines. Unfortunately we do not have uniform 
message sizes, making generic scatter and gather routines 
unusable. We will therefore have to write these routines 
ourselves. 

Python has similar speedup to C++, but the curve flat- 
tens out much faster as we increase the number of CPUs. 
As Monte Carlo simulations are known to have perfect 
speedup, we cannot be satisfied with the parallel algo- 
rithms of either the C++ version or the Python version. 

VI. OPTIMIZING MONTEPYTHON 

One of the points in using Python in scientific program- 
ming is that you can implement new and improved algo- 
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FIG. 3: Speedup of a simulation as a function of the number of CPUs used. In the left figure the serial run took about 30 
minutes and was run with an initial 480 walkers moved in 3500 time steps. In the right figure the serial run took about 4 hours 
30 minutes and was run with an initial 4800 walkers moved in 1750 time steps. 



rithms efficiently. We have seen that distribution of the 
random walkers over the compute nodes leads to a bot- 
tleneck due to communication when the number of CPUs 
grows large. This bottleneck is evident both for the C++ 
implementation and the Python implementation. In this 
section wc will improve the algorithm for load balancing 
of the walker in two ways. First, we will improve on the 
way the walkers are killed and reproduced. Second, we 
will improve on the load balancing itself by optimizing 
for heterogeneous clusters of CPUs. 

A. Distributing Walkers in Parallel 

The C++ Diffusion Monte Carlo application was origi- 
nally written in serial and then ported to parallel using 
MPI. In the serial version we used an algorithm where, 
in order to kill a walker, we moved the last walker in the 
sequence onto the walker that was to be killed and de- 
creased the number of walkers with one. To reproduce a 



walker, we copied the walker to the end of the sequence. 
This algorithm is optimal and widely used in serial Dif- 
fusion Monte Carlo. When we parallelized the code, we 
kept this serial algorithm by gathering all the walkers to 
the master node, let the master node do the killing and 
reproducing in serial, and then spread the walkers evenly 
among the slave nodes again. This way the load was 
always balanced, and the master had full control of the 
walkers at all times. The main problem is that, apart 
from memory issues as the master needs to store a lot 
of walkers, the serial work load for the master increases 
fast when we increase the number of CPUs in the calcu- 
lation. This problem is very clear in figure 3, where the 
speedup is quite poor already for 32 CPUs, both for the 
C++ version and the Python version. 
Again, the algorithm is best explained through the source 
code. First we let the slave nodes individually move 
their walkers and kill and reproduce their local walk- 
ers, then the function DMC.load_balancing(), balances 
the load: 



def load .balancing ( s e 1 f ) : 

"""Function balancing load between nodes""" 
self.tl = time.time() 

w_numbers = self, pypar. gather ( Numeric .array([len(self.w_block)]) , 

self . master_rank) 
tmp.w.numbers = copy . deepcopy ( w.numbers) 
w_numbers = s e 1 f . pypar . broadcast ( tmp.w.numbers , 

self . master.rank) 

s e 1 f . no.of .walkers = Numeric . sum ( w.numbers) 

self . ..find.opt.w.p.node () 

s e 1 f . f ir s t _b alanc e = False 

balanced = Numeric . array ( s e 1 f . loc .walker s ) 
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difference = w_numbers— balanced 

diff_sort = Numeric . argsort ( difference ) 
prev_i_min = diff_sort [0] 

while sum ( abs ( d i f f er enc e ) ) ! = : 

diff_sort = Numeric . argsort ( difference ) 
i.max = diff .sort [ — 1] 
i_min = diff_sort [0] 

if i_min = prev_i_min : 
i_min = diff_sort [1] 

if s e 1 f . myrank=i_max : 

self, pypar . send ( self . w_block [balanced [ i.max ] : ] , i_min ) 
args = [ balanced [ i.max ] , 
self, particles , 
self .dimensions , 

self .w_block [0: balanced [ i.max ] ] 
self.w_block = Walker Array . Walker Array (* args ) 
elif s e 1 f . myrank^=i_min : 

recv_buff = s e 1 f . pypar . rece ive ( i.max ) 

args = [ len ( s e 1 f . w_block)+ di f f er ence [ i.max ] , 

self, particles , 

self . dimensions , 

Numeric .concatenate((self.w_block [:] , recv_buff))] 
self.w_block = Walker Array . Walker Array (* args ) 
difference [ i_min]+=difference [i.max] 
difference [i_max] = 
prev_i_min = i_min 

I 



This function deserves some explanation. From line 5 to 
10 we update the current walker distribution and total 
number of walkers. In line 12 we determine the optimal 
distribution of walkers, to be explained in subsection 
VI B. At this point we know the actual distribution of 
walkers and the optimal distribution of walkers. The idea 
is then to find the length of the difference between the 
optimal and actual distribution and move walkers among 
nodes until the length, or the sum of the absolute value 
of the differences is is zero, see line 22. This is realized in 
lines 30-45 by moving the excess walkers from the node 
with maximum difference to the walker with minimum 
difference recursively. A problem with this procedure is 
that the the same node can have a minimum difference 
in subsequent cycles of the while- loop [26]. This leads to 
unnecessary waiting in the program. The remedy is seen 
in line 27 where we take the second minimum node if 
the minimum node is the same node as in the previous 
cycle. Of course this problem may just be transferred to 
the second minimum node, but this is much less likely to 
happen. 

It should be noted that this optimization does not pre- 
serve the result from the non-optimized code, in the sense 
that we will not get an identical output in the end. This is 
due to the fact that the random sequences are distributed 
per node and not per walker, meaning that each walker 
will get a different series of random numbers depending 
on which node it is sent to. The output will, however, be 



within the error range of the non-optimized code. 



B. Heterogeneous clusters 

Most new high performance clusters are more or less 
homogeneous, in the sense that the computation nodes 
have identical specifications with respect to CPU, RAM, 
network and storage. However, as a cluster usually ex- 
pands in time, due to more funds and need for more 
resources, it is very likely that it will become a heteroge- 
neous cluster [27]. Also, with the trend of multiple cores 
and CPUs per computation nodes, combined with the 
fact that there is more than one user per cluster, differ- 
ent nodes will have different (and possibly too high) load 
and therefore different computational speed. This means 
that even if a cluster is homogeneous on paper, it will 
act like a heterogeneous cluster in practice. If we want 
to gain the optimal performance from a cluster, we need 
to take into account this heterogeneity. 
In the function _f ind_opt_w_p_node () we use the time 
from the DMC class is initialized to the function is called 
to determine how the optimal distribution of walkers at 
every time step. The function itself can be seen in list- 
ing 4. To understand this function we just need some 
simple linear algebra. Say that we have set of walkers 
[xi,X2, ■ ■ - Xn] spread in an optimal way over N nodes. 
On node i the time to move one walker is given by Oj 
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Listing 4: Finding optimal walkers per node 

def find_opt_w_p_node ( self ) : 

"""Help function for load_balancing () 

Finds and returns the optimal number of walkers 

per node for a possibly non-uniform set of nodes 
it it it 

self.tl = time . time ( ) 

timings = self, pypar . gather ( Numeric . array ( [ abs( self . 1 1 — self . tO )] ) 

self . master_rank) 
tmp_timings = copy . deepcopy ( timings ) 
timings = s e 1 f . pypar . broadcast ( tmp_timings , 

self . master_rank) 

C = se If . no.of .walkers /sum ( 1 ./ timings ) 

tmp_loc_walkers = C/timings 

s e 1 f . loc .walkers = s e 1 f . NumericFloat 2Int List ( t mp_loc_walkers ) 
remainders = tmp_loc_walkers — s e 1 f . loc .walkers 

while sum( self . loc_walkers ) < s e 1 f . no.of .walkers : 
maxarg = Numeric . argmax ( remainders ) 
s e 1 f . loc _walker s [ maxarg ] += 1 
remainders [ maxarg ] = 

if self. master and s e 1 f . f i r s t _b al anc e : 
print timings 
print self . loc_walkers 

return s e 1 f . loc .walkers 



yielding a set [a\, a<2, ■ ■ ■ ffljv]- By optimal distribution we 
mean a distribution of walkers were each node finishes 
the work assigned to it between synchronizations at the 
same time C, i.e. aiXi — C . In addition we know the 
total number of walkers, T. We know that 



c 



(46) 



so that the problem reduces to finding C. However, as 
T = %i > w e have 



fact, the new version is more than twice as fast as the 
C++ version for 96 CPUs. 

These optimizations would of course be possible to do 
in the CH — h application as well, albeit in more than 75 
lines. However, the simple syntax in Python and the use 
of Numeric arrays to store walkers allow us to concentrate 
our effort directly on the optimization of the algorithm 
instead of dealing with, e.g., how to send, receive and 
concatenate a slice of walkers. 



C 



EuV 



(47) 



The zealous reader will notice that eq. (47) corresponds 
to line 13 of __f ind_opt_w_p_node () . The rest of the 
function is merely taking care of the fact that 
integers while dj and C are real numbers. 



C. Optimized results 

In figure 4 we show the speedup and walltime of the 
improved walker distribution compared to the non- 
optimized C++ and Python version from figure 3. We 
see that both the speedup and the walltime is much more 
reasonable for the optimized version of MontcPython. In 



VII. VISUALIZING WITH PYTHON 

We mentioned in the introduction that integration of 
simulation and visualization is an important feature of 
Matlab, Maple and others. This feature is maybe even 
more powerful in Python. In figures 5, 6 and 6 we have 
used pyVTK and Mayavi [18] to plot the particle density, 
which is an output of diffusion Monte Carlo. pyVTK [19] 
is a python interface to the Visualization ToolKit (VTK), 
while Mayavi, which is built on pyVTK, is a scriptablc 
graphic interface for 3D visualization. 
A signature of a Bose-Einstcin condensate is that it is 
irrotational. If we try to rotate the condensate, it will 
compensate by setting up quantum vortices along the 
rotational axis. Vortices is therefore crucial to the study 
of Bose-Einstein condensates. We need only small mod- 



16 




FIG. 4: The figures show the speedup (left figure) and walltime (right figure) for the simulation as a function of the number of 
CPUs used. The serial run took about 270 minutes and was run with an initial 4800 walkers moved in 1750 time steps. 




FIG. 5: The figures show where particles are detected. We 
see the expectation values in two spatial dimensions, the yz- 
plane to the left and the xy-plane to the right. The topmost 
figure corresponds to the ground state, while the bottom fig- 
ure corresponds to a state with one vortex in the center of the 
trap. 



ifications to the Hamiltonian (eq. (3)) and trial wave 
function (eq. (4)) to consider a single vortex along the 
z-axis in our system [7] . 

Figures 5 and 6 shows the change in the ground state 
when inserting the vortex. The repulsive nature of the 
vortex pushes the particles away from the z-axis, decreas- 
ing the maximum density when compared to the ground 
state. 




FIG. 6: The figures show where particles are detected. We 
see the expectation values for finding, from left to right, 1, 
2, 3 and 4 particles. The topmost figure corresponds to the 
ground state, while the bottom figure corresponds to a state 
with one vortex in the center of the trap. 



VIII. CONCLUSION 

We have implemented a Monte Carlo solver using three 
different approaches, from pure C-l — h, through a straight 
forward Python implementation, to an efficient, vector- 
ized Python implementation. Furthermore we have com- 
pared the C++ and the vectorized Python implementa- 
tions and shown that the overhead of using Python is 
non-existent for sufficiently large problems. In fact, with 
only 75 lines of code we were able to introduce an im- 
proved parallel algorithm for walker distribution, to gain 
a super-linear speedup when compared to C++. In ad- 
dition, we have shown that Python can be used directly 
as a visualization tool for rendering three dimensionally 
scientific visualizations. We can therefore safely say that 
Python serves as a powerful tool in scientific program- 
ming. 
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